Skip to content

Fix TEGroupedLinear quantization for expert parallelism (EP > 1)#833

Merged
yueshen2016 merged 1 commit intomainfrom
yueshen/fix-te-grouped-linear-ep-quantization
Feb 6, 2026
Merged

Fix TEGroupedLinear quantization for expert parallelism (EP > 1)#833
yueshen2016 merged 1 commit intomainfrom
yueshen/fix-te-grouped-linear-ep-quantization

Conversation

@yueshen2016
Copy link
Copy Markdown
Contributor

@yueshen2016 yueshen2016 commented Jan 30, 2026

What does this PR do?

Type of change: Bug fix / Compatibility update

Overview:

Fix te_grouped_quantized_linear_fn argument parsing for TEGroupedLinear quantization when parallelism configuration results in fewer local experts per GPU.

Problem

TransformerEngine changed the _GroupedLinear.forward signature in PR #2377 (released in TE 2.10):
Old signature (TE < 2.10): forward(ctx, inp, m_splits: List[int], use_bias, is_first_microbatch, ...)
New signature (TE >= 2.10): forward(ctx, inp, non_tensor_args: Tuple, *weights_and_biases) where non_tensor_args = (m_splits, use_bias, is_first_microbatch, ...)
Without this fix, ModelOpt's quantization code fails with newer TE versions because it tries to access m_splits directly from args[idx + 1], but in TE >= 2.10, that position contains the non_tensor_args tuple instead.

Root Cause

The code assumed m_splits was always directly accessible at args[idx + 1], but TransformerEngine PR #2377 changed the signature to pack all non-tensor arguments into a tuple.
Taking Qwen3-30B-A3B (with num_gemms=21, threshold=44) as an example:

Solution

Added version checking to handle both signatures:

if Version("2.10") <= _TE_VERSION:
    # New signature: non_tensor_args is a tuple, m_splits is the first element
    num_gemms = len(args[idx + 1][0])
else:
    # Old signature: m_splits is directly args[idx + 1]
    num_gemms = len(args[idx + 1])

Usage

Works seamlessly with any TransformerEngine version:

# High EP quantization - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
  --hf-model-id /models/Qwen3-30B-A3B \
  --export-quant-cfg fp8 \
  --megatron-save-path /models/Qwen3-30B-A3B_fp8_mlm \
  --tp 8 \
  --ep 8

# High EP inference - previously failed, now works  
torchrun --nproc_per_node 8 examples/quantization/ptq_generate.py \
  --megatron-load-path /models/Qwen3-30B-A3B_fp8_mlm \
  --hf-model-id /models/Qwen3-30B-A3B \
  --tp 8 \
  --ep 8

Testing

# High EP quantization - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
  --hf-model-id /models/Qwen3-30B-A3B \
  --export-quant-cfg fp8 \
  --megatron-save-path /models/Qwen3-30B-A3B_fp8_mlm \
  --tp 8 \
  --ep 8

# High EP inference - previously failed, now works  
torchrun --nproc_per_node 8 examples/quantization/ptq_generate.py \
  --megatron-load-path /models/Qwen3-30B-A3B_fp8_mlm \
  --hf-model-id /models/Qwen3-30B-A3B \
  --tp 8 \
  --ep 8

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced Mixture of Experts (MoE) calibration validation and synchronization to ensure consistency across distributed training setups.
    • Improved grouped linear quantization robustness to handle varying input patterns and tensor dimensions.
  • Improvements

    • Better error handling for incomplete MoE expert calibration detection.
    • More flexible argument parsing for quantization operations.

✏️ Tip: You can customize this high-level summary in your review settings.

@yueshen2016 yueshen2016 self-assigned this Jan 30, 2026
@yueshen2016 yueshen2016 requested a review from a team as a code owner January 30, 2026 22:09
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 30, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review
📝 Walkthrough

Walkthrough

The changes refactor Mixture-of-Experts (MoE) calibration handling in PyTorch quantization across three modules. They add explicit MoE calibration validation and local expert amax synchronization in model_calib.py, remove the specialized _QuantMoELayer class from megatron.py, and improve argument parsing robustness in transformer_engine.py's grouped linear quantization path for varying input configurations.

Changes

Cohort / File(s) Summary
MoE Calibration Validation & Synchronization
modelopt/torch/quantization/model_calib.py
Introduces _has_expert_parallelism() and _check_moe_calibration_complete() functions to detect expert-parallelism and validate calibration completeness across distributed groups. Adds local expert amax synchronization in max_calibrate() before distributed sync, with validation checks ensuring calibration consistency before proceeding.
MoE Layer Quantization Removal
modelopt/torch/quantization/plugins/megatron.py
Removes unused import and deletes entire _QuantMoELayer class, eliminating specialized token-dispatch calibration handling for MoE layers during quantization.
Grouped Linear Quantization Robustness
modelopt/torch/quantization/plugins/transformer_engine.py
Reworks TE grouped linear quantization to robustly parse weights/biases from varying argument positions (tail vs. remaining_args). Introduces flexible argument splitting and reconstruction logic to handle both single-partition (ep=1) and multi-partition (ep>1) invocation patterns, improving compatibility with different argument list lengths.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the primary change: fixing TEGroupedLinear quantization to support expert parallelism when EP > 1, which is the core bug fix addressed across multiple files.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch yueshen/fix-te-grouped-linear-ep-quantization

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.72%. Comparing base (452c5a0) to head (197ecda).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff            @@
##             main     #833    +/-   ##
========================================
  Coverage   73.72%   73.72%            
========================================
  Files         196      197     +1     
  Lines       20457    20625   +168     
========================================
+ Hits        15082    15206   +124     
- Misses       5375     5419    +44     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yueshen2016 yueshen2016 force-pushed the yueshen/fix-te-grouped-linear-ep-quantization branch from 0deb9b6 to a85f04e Compare January 30, 2026 23:01
Comment thread modelopt/torch/quantization/plugins/transformer_engine.py Outdated
@yueshen2016 yueshen2016 force-pushed the yueshen/fix-te-grouped-linear-ep-quantization branch 4 times, most recently from 53aec4f to febe313 Compare February 6, 2026 07:55
@yueshen2016 yueshen2016 requested a review from realAsma February 6, 2026 07:58
Signed-off-by: James Shen <yueshen@nvidia.com>
@yueshen2016 yueshen2016 force-pushed the yueshen/fix-te-grouped-linear-ep-quantization branch from febe313 to 197ecda Compare February 6, 2026 17:44
@yueshen2016 yueshen2016 enabled auto-merge (squash) February 6, 2026 17:45
@yueshen2016 yueshen2016 merged commit 3393e98 into main Feb 6, 2026
37 checks passed
@yueshen2016 yueshen2016 deleted the yueshen/fix-te-grouped-linear-ep-quantization branch February 6, 2026 19:43
danielkorzekwa pushed a commit that referenced this pull request Feb 17, 2026
## What does this PR do?

**Type of change:** Bug fix / Compatibility update

**Overview:**

Fix `te_grouped_quantized_linear_fn` argument parsing for
TEGroupedLinear quantization when parallelism configuration results in
fewer local experts per GPU.

### Problem
TransformerEngine changed the _GroupedLinear.forward signature in PR
#2377 (released in TE 2.10):
Old signature (TE < 2.10): forward(ctx, inp, m_splits: List[int],
use_bias, is_first_microbatch, ...)
New signature (TE >= 2.10): forward(ctx, inp, non_tensor_args: Tuple,
*weights_and_biases) where non_tensor_args = (m_splits, use_bias,
is_first_microbatch, ...)
Without this fix, ModelOpt's quantization code fails with newer TE
versions because it tries to access m_splits directly from args[idx +
1], but in TE >= 2.10, that position contains the non_tensor_args tuple
instead.


### Root Cause
The code assumed m_splits was always directly accessible at args[idx +
1], but TransformerEngine PR #2377 changed the signature to pack all
non-tensor arguments into a tuple.
Taking Qwen3-30B-A3B (with `num_gemms=21`, threshold=44) as an example:

### Solution
Added version checking to handle both signatures:
```python
if Version("2.10") <= _TE_VERSION:
    # New signature: non_tensor_args is a tuple, m_splits is the first element
    num_gemms = len(args[idx + 1][0])
else:
    # Old signature: m_splits is directly args[idx + 1]
    num_gemms = len(args[idx + 1])
```

## Usage
<!-- You can potentially add a usage example below. -->
Works seamlessly with any TransformerEngine version:

```python
# High EP quantization - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
  --hf-model-id /models/Qwen3-30B-A3B \
  --export-quant-cfg fp8 \
  --megatron-save-path /models/Qwen3-30B-A3B_fp8_mlm \
  --tp 8 \
  --ep 8

# High EP inference - previously failed, now works  
torchrun --nproc_per_node 8 examples/quantization/ptq_generate.py \
  --megatron-load-path /models/Qwen3-30B-A3B_fp8_mlm \
  --hf-model-id /models/Qwen3-30B-A3B \
  --tp 8 \
  --ep 8
```

## Testing
<!-- Mention how have you tested your change if applicable. -->
```python
# High EP quantization - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
  --hf-model-id /models/Qwen3-30B-A3B \
  --export-quant-cfg fp8 \
  --megatron-save-path /models/Qwen3-30B-A3B_fp8_mlm \
  --tp 8 \
  --ep 8

# High EP inference - previously failed, now works  
torchrun --nproc_per_node 8 examples/quantization/ptq_generate.py \
  --megatron-load-path /models/Qwen3-30B-A3B_fp8_mlm \
  --hf-model-id /models/Qwen3-30B-A3B \
  --tp 8 \
  --ep 8
```

## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->

- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes/No <!--- If No, explain
why. -->
- **Did you write any new necessary tests?**: Yes/No
- **Did you add or update any necessary documentation?**: Yes/No
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
Yes/No <!--- Only for new features, API changes, critical bug fixes or
bw breaking changes. -->

## Additional Information
<!-- E.g. related issue. -->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Enhanced Mixture of Experts (MoE) calibration validation and
synchronization to ensure consistency across distributed training
setups.
* Improved grouped linear quantization robustness to handle varying
input patterns and tensor dimensions.

* **Improvements**
* Better error handling for incomplete MoE expert calibration detection.
  * More flexible argument parsing for quantization operations.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: James Shen <yueshen@nvidia.com>
sugunav14 pushed a commit that referenced this pull request Feb 17, 2026
## What does this PR do?

**Type of change:** Bug fix / Compatibility update

**Overview:**

Fix `te_grouped_quantized_linear_fn` argument parsing for
TEGroupedLinear quantization when parallelism configuration results in
fewer local experts per GPU.

### Problem
TransformerEngine changed the _GroupedLinear.forward signature in PR
#2377 (released in TE 2.10):
Old signature (TE < 2.10): forward(ctx, inp, m_splits: List[int],
use_bias, is_first_microbatch, ...)
New signature (TE >= 2.10): forward(ctx, inp, non_tensor_args: Tuple,
*weights_and_biases) where non_tensor_args = (m_splits, use_bias,
is_first_microbatch, ...)
Without this fix, ModelOpt's quantization code fails with newer TE
versions because it tries to access m_splits directly from args[idx +
1], but in TE >= 2.10, that position contains the non_tensor_args tuple
instead.


### Root Cause
The code assumed m_splits was always directly accessible at args[idx +
1], but TransformerEngine PR #2377 changed the signature to pack all
non-tensor arguments into a tuple.
Taking Qwen3-30B-A3B (with `num_gemms=21`, threshold=44) as an example:

### Solution
Added version checking to handle both signatures:
```python
if Version("2.10") <= _TE_VERSION:
    # New signature: non_tensor_args is a tuple, m_splits is the first element
    num_gemms = len(args[idx + 1][0])
else:
    # Old signature: m_splits is directly args[idx + 1]
    num_gemms = len(args[idx + 1])
```

## Usage
<!-- You can potentially add a usage example below. -->
Works seamlessly with any TransformerEngine version:

```python
# High EP quantization - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
  --hf-model-id /models/Qwen3-30B-A3B \
  --export-quant-cfg fp8 \
  --megatron-save-path /models/Qwen3-30B-A3B_fp8_mlm \
  --tp 8 \
  --ep 8

# High EP inference - previously failed, now works  
torchrun --nproc_per_node 8 examples/quantization/ptq_generate.py \
  --megatron-load-path /models/Qwen3-30B-A3B_fp8_mlm \
  --hf-model-id /models/Qwen3-30B-A3B \
  --tp 8 \
  --ep 8
```

## Testing
<!-- Mention how have you tested your change if applicable. -->
```python
# High EP quantization - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
  --hf-model-id /models/Qwen3-30B-A3B \
  --export-quant-cfg fp8 \
  --megatron-save-path /models/Qwen3-30B-A3B_fp8_mlm \
  --tp 8 \
  --ep 8

# High EP inference - previously failed, now works  
torchrun --nproc_per_node 8 examples/quantization/ptq_generate.py \
  --megatron-load-path /models/Qwen3-30B-A3B_fp8_mlm \
  --hf-model-id /models/Qwen3-30B-A3B \
  --tp 8 \
  --ep 8
```

## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->

- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes/No <!--- If No, explain
why. -->
- **Did you write any new necessary tests?**: Yes/No
- **Did you add or update any necessary documentation?**: Yes/No
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
Yes/No <!--- Only for new features, API changes, critical bug fixes or
bw breaking changes. -->

## Additional Information
<!-- E.g. related issue. -->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Enhanced Mixture of Experts (MoE) calibration validation and
synchronization to ensure consistency across distributed training
setups.
* Improved grouped linear quantization robustness to handle varying
input patterns and tensor dimensions.

* **Improvements**
* Better error handling for incomplete MoE expert calibration detection.
  * More flexible argument parsing for quantization operations.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: James Shen <yueshen@nvidia.com>
kevalmorabia97 pushed a commit that referenced this pull request Feb 20, 2026
## What does this PR do?

**Type of change:** Bug fix / Compatibility update

**Overview:**

Fix `te_grouped_quantized_linear_fn` argument parsing for
TEGroupedLinear quantization when parallelism configuration results in
fewer local experts per GPU.

### Problem
TransformerEngine changed the _GroupedLinear.forward signature in PR
#2377 (released in TE 2.10):
Old signature (TE < 2.10): forward(ctx, inp, m_splits: List[int],
use_bias, is_first_microbatch, ...)
New signature (TE >= 2.10): forward(ctx, inp, non_tensor_args: Tuple,
*weights_and_biases) where non_tensor_args = (m_splits, use_bias,
is_first_microbatch, ...)
Without this fix, ModelOpt's quantization code fails with newer TE
versions because it tries to access m_splits directly from args[idx +
1], but in TE >= 2.10, that position contains the non_tensor_args tuple
instead.


### Root Cause
The code assumed m_splits was always directly accessible at args[idx +
1], but TransformerEngine PR #2377 changed the signature to pack all
non-tensor arguments into a tuple.
Taking Qwen3-30B-A3B (with `num_gemms=21`, threshold=44) as an example:

### Solution
Added version checking to handle both signatures:
```python
if Version("2.10") <= _TE_VERSION:
    # New signature: non_tensor_args is a tuple, m_splits is the first element
    num_gemms = len(args[idx + 1][0])
else:
    # Old signature: m_splits is directly args[idx + 1]
    num_gemms = len(args[idx + 1])
```

## Usage
<!-- You can potentially add a usage example below. -->
Works seamlessly with any TransformerEngine version:

```python
# High EP quantization - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
  --hf-model-id /models/Qwen3-30B-A3B \
  --export-quant-cfg fp8 \
  --megatron-save-path /models/Qwen3-30B-A3B_fp8_mlm \
  --tp 8 \
  --ep 8

# High EP inference - previously failed, now works  
torchrun --nproc_per_node 8 examples/quantization/ptq_generate.py \
  --megatron-load-path /models/Qwen3-30B-A3B_fp8_mlm \
  --hf-model-id /models/Qwen3-30B-A3B \
  --tp 8 \
  --ep 8
```

## Testing
<!-- Mention how have you tested your change if applicable. -->
```python
# High EP quantization - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
  --hf-model-id /models/Qwen3-30B-A3B \
  --export-quant-cfg fp8 \
  --megatron-save-path /models/Qwen3-30B-A3B_fp8_mlm \
  --tp 8 \
  --ep 8

# High EP inference - previously failed, now works  
torchrun --nproc_per_node 8 examples/quantization/ptq_generate.py \
  --megatron-load-path /models/Qwen3-30B-A3B_fp8_mlm \
  --hf-model-id /models/Qwen3-30B-A3B \
  --tp 8 \
  --ep 8
```

## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->

- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes/No <!--- If No, explain
why. -->
- **Did you write any new necessary tests?**: Yes/No
- **Did you add or update any necessary documentation?**: Yes/No
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
Yes/No <!--- Only for new features, API changes, critical bug fixes or
bw breaking changes. -->

## Additional Information
<!-- E.g. related issue. -->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Enhanced Mixture of Experts (MoE) calibration validation and
synchronization to ensure consistency across distributed training
setups.
* Improved grouped linear quantization robustness to handle varying
input patterns and tensor dimensions.

* **Improvements**
* Better error handling for incomplete MoE expert calibration detection.
  * More flexible argument parsing for quantization operations.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: James Shen <yueshen@nvidia.com>
danielkorzekwa pushed a commit that referenced this pull request Mar 4, 2026
## What does this PR do?

**Type of change:** Bug fix / Compatibility update

**Overview:**

Fix `te_grouped_quantized_linear_fn` argument parsing for
TEGroupedLinear quantization when parallelism configuration results in
fewer local experts per GPU.

### Problem
TransformerEngine changed the _GroupedLinear.forward signature in PR
#2377 (released in TE 2.10):
Old signature (TE < 2.10): forward(ctx, inp, m_splits: List[int],
use_bias, is_first_microbatch, ...)
New signature (TE >= 2.10): forward(ctx, inp, non_tensor_args: Tuple,
*weights_and_biases) where non_tensor_args = (m_splits, use_bias,
is_first_microbatch, ...)
Without this fix, ModelOpt's quantization code fails with newer TE
versions because it tries to access m_splits directly from args[idx +
1], but in TE >= 2.10, that position contains the non_tensor_args tuple
instead.

### Root Cause
The code assumed m_splits was always directly accessible at args[idx +
1], but TransformerEngine PR #2377 changed the signature to pack all
non-tensor arguments into a tuple.
Taking Qwen3-30B-A3B (with `num_gemms=21`, threshold=44) as an example:

### Solution
Added version checking to handle both signatures:
```python
if Version("2.10") <= _TE_VERSION:
    # New signature: non_tensor_args is a tuple, m_splits is the first element
    num_gemms = len(args[idx + 1][0])
else:
    # Old signature: m_splits is directly args[idx + 1]
    num_gemms = len(args[idx + 1])
```

## Usage
<!-- You can potentially add a usage example below. -->
Works seamlessly with any TransformerEngine version:

```python
# High EP quantization - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
  --hf-model-id /models/Qwen3-30B-A3B \
  --export-quant-cfg fp8 \
  --megatron-save-path /models/Qwen3-30B-A3B_fp8_mlm \
  --tp 8 \
  --ep 8

# High EP inference - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/ptq_generate.py \
  --megatron-load-path /models/Qwen3-30B-A3B_fp8_mlm \
  --hf-model-id /models/Qwen3-30B-A3B \
  --tp 8 \
  --ep 8
```

## Testing
<!-- Mention how have you tested your change if applicable. -->
```python
# High EP quantization - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
  --hf-model-id /models/Qwen3-30B-A3B \
  --export-quant-cfg fp8 \
  --megatron-save-path /models/Qwen3-30B-A3B_fp8_mlm \
  --tp 8 \
  --ep 8

# High EP inference - previously failed, now works
torchrun --nproc_per_node 8 examples/quantization/ptq_generate.py \
  --megatron-load-path /models/Qwen3-30B-A3B_fp8_mlm \
  --hf-model-id /models/Qwen3-30B-A3B \
  --tp 8 \
  --ep 8
```

## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->

- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes/No <!--- If No, explain
why. -->
- **Did you write any new necessary tests?**: Yes/No
- **Did you add or update any necessary documentation?**: Yes/No
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
Yes/No <!--- Only for new features, API changes, critical bug fixes or
bw breaking changes. -->

## Additional Information
<!-- E.g. related issue. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Enhanced Mixture of Experts (MoE) calibration validation and
synchronization to ensure consistency across distributed training
setups.
* Improved grouped linear quantization robustness to handle varying
input patterns and tensor dimensions.

* **Improvements**
* Better error handling for incomplete MoE expert calibration detection.
  * More flexible argument parsing for quantization operations.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: James Shen <yueshen@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants